3_RNAseq
Master Medical Biometry/Biostatistics, Introduction to Bioinformatics, Medizinische Fakultät Heidelberg
The purpose of these exercises is to introduce you to the post-processing of RNA-seq data, in this case the differential expression analysis.
Provide answers to questions that are marked with ‘Q’.
The Data
Leishmaniasis, is a disease caused by protozoan parasites of the genus Leishmania and spread by the bite of certain types of sandflies. Samples from infected and healthy individuals were sequenced. You can find the article of this study here.
- The raw data (
fastqfiles) was downloaded from the ENA (European Nucleotide Archive) database, under the projectE-MTAB-4902.
These are the samples used:
| Run Accession | Type | ID |
|---|---|---|
| ERR1523942 | Normal | Ctrl1 |
| ERR1523943 | Normal | Ctrl2 |
| ERR1523944 | Normal | Ctrl3 |
| ERR1523945 | Normal | Ctrl4 |
| ERR1523946 | Normal | Ctrl5 |
| ERR1523947 | Visceral Leishmaniasis | VL1 |
| ERR1523948 | Visceral Leishmaniasis | VL2 |
| ERR1523949 | Visceral Leishmaniasis | VL3 |
- These samples were mapped towards the human genome
- Gene counts were generated with
HTseqand stored ingeneCounts.txt.
Download the file from here and unzip it.
Loading packages
There are many software packages for this, such as DESeq2 and edgeR. For today´s exercise we will use DESeq2, an R package that estimates variance-mean dependence in count data from high-throughput sequencing assays and tests for differential expression based on a model using the negative binomial distribution. DESeq2 uses the raw read counts per gene as input.
Start R in your local computer.
Select your working directory (where you have the
geneCounts.txt file) by clicking:
Session -> Set Working Directory -> Choose directory
This is a copy/paste tutorial so we can focus on the interpretation of the results; however it is important to understand what each line is doing, so if you have any questions, just raise your hand!
Load the following libraries:
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind,
## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
## get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
## tapply, union, unique, unsplit, which.max, which.min
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:grDevices':
##
## windows
## Loading required package: GenomicRanges
## Loading required package: GenomeInfoDb
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
##
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
##
## rowMedians
## The following objects are masked from 'package:matrixStats':
##
## anyMissing, rowMedians
## Loading required package: ggplot2
## Loading required package: ggrepel
Let’s define the metadata for our samples, this will set the colors to be displayed in our graphs and define the grouping for the analysis:
# Defining the sample names
sampleNames <- c("Ctrl1", "Ctrl2", "Ctrl3", "Ctrl4", "Ctrl5", "VL1", "VL2", "VL3")
# Defining the sample grouping
conditionNames <- data.frame(condition = c(rep("Ctrl",5), rep("VL",3)))
# Defining the Control and the Visceral Leishmaniasis group for plotting
annotation_col <- data.frame(Type = c(rep("Ctrl",5), rep("VL",3)))
# Defining the colors by group for plotting
annotation_colors <- list(Type = c(Ctrl = "blue",
VL = "red"))
rownames(annotation_col) = sampleNamesNow we load the gene counts:
# Reading the sample information
setwd("C:/Users/marce/OneDrive/Desktop")
counts <- read.table("geneCounts.txt",
header = TRUE,
row.names = 1,
sep = "\t",
as.is = TRUE)
colnames(counts) <- sampleNames
head(counts)## Ctrl1 Ctrl2 Ctrl3 Ctrl4 Ctrl5 VL1 VL2 VL3
## ENSG00000000003 19 13 12 9 10 2 5 4
## ENSG00000000419 470 368 273 205 281 398 378 714
## ENSG00000000457 813 816 754 665 837 835 771 687
## ENSG00000000460 213 180 112 110 94 233 225 209
## ENSG00000000938 18215 28421 24508 24414 29163 28888 13267 24022
## ENSG00000000971 26 30 91 57 61 47 121 57
Ctrl1 Ctrl2 Ctrl3 Ctrl4 Ctrl5 VL1 VL2 VL3
ENSG00000000003 19 13 12 9 10 2 5 4
ENSG00000000419 470 368 273 205 281 398 378 714
ENSG0000000457 813 816 754 665 837 835 771 687
ENSG00000000460 213 180 112 110 94 233 225 209
ENSG00000000938 18215 28421 24508 24414 29163 28888 13267 24022
ENSG00000000971 26 30 91 57 61 47 121 57
And construct the main data object. The
DESeqDataSetFromMatrix function constructs a DESeq2 data
set object integrating the counts, the samples description and the
experimental design:
# Creating the DESeqDataSet
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = conditionNames,
design =~ condition)## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
## class: DESeqDataSet
## dim: 6 8
## metadata(1): version
## assays(1): counts
## rownames(6): ENSG00000000003 ENSG00000000419 ... ENSG00000000938
## ENSG00000000971
## rowData names(0):
## colnames(8): Ctrl1 Ctrl2 ... VL2 VL3
## colData names(1): condition
Quality control and dimensional reduction
One of the first things that we need to do is to assess the overall similarity between the samples. We can create a PCA plot:
## using ntop=500 top features by variance
Or we can look at the Euclidean distance:
# Calculating distance among samples
sampleDist <- dist(t(assay(rld)))
sampleDistMtx <- as.matrix(sampleDist)
rownames(sampleDistMtx) <- sampleNames
colnames(sampleDistMtx) <- sampleNames
# Plotting the distances
# Select between Blues, Reds, Purples, Oranges or Greens
hmcol = colorRampPalette( rev(brewer.pal(9, "Purples")) )(100)
pheatmap(sampleDistMtx,
# cluster_rows = sampleDist,
# cluster_cols = sampleDist,
main = "Sample distances",
fontsize_row = 12,
annotation_col = annotation_col,
annotation_colors = annotation_colors,
col = hmcol)Q1. Is the clustering consistent with our different groups?
Now we can proceed with the differential expression testing. The
DESeq function does all the hard work, it basically
performs three steps:
- Compute a scaling factor for each sample to account for differences in read depth and complexity between samples
- Estimate the variance among samples
- Test for differences in expression among groups
Differential expression
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## class: DESeqDataSet
## dim: 6 8
## metadata(1): version
## assays(4): counts mu H cooks
## rownames(6): ENSG00000000003 ENSG00000000419 ... ENSG00000000938
## ENSG00000000971
## rowData names(22): baseMean baseVar ... deviance maxCooks
## colnames(8): Ctrl1 Ctrl2 ... VL2 VL3
## colData names(2): condition sizeFactor
## log2 fold change (MLE): condition VL vs Ctrl
## Wald test p-value: condition VL vs Ctrl
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003 9.14582 -1.8138943 0.637905 -2.843516 0.00446188
## ENSG00000000419 374.42530 0.5802608 0.279446 2.076470 0.03785052
## ENSG00000000457 770.53260 -0.0726711 0.200377 -0.362671 0.71685048
## ENSG00000000460 168.28316 0.6359161 0.269238 2.361909 0.01818110
## ENSG00000000938 23847.51068 -0.2688650 0.306979 -0.875843 0.38111551
## ENSG00000000971 62.98197 0.4261796 0.556256 0.766157 0.44358272
## padj
## <numeric>
## ENSG00000000003 0.0392143
## ENSG00000000419 0.1601584
## ENSG00000000457 0.8627233
## ENSG00000000460 0.1004275
## ENSG00000000938 0.6296845
## ENSG00000000971 0.6831105
Q2. How many genes have you tested? use
nrow(res)
## [1] 26171
You can create a histogram of the padjusted and an MA-plot to inspect if there is any enrichment of DE genes:
# Padjusted histogram
hist(res$padj,
breaks = 100,
col = "green",
border = "darkgreen",
main = "padjusted distribution",
xlab = "padjusted")Q3. What does the histogram tells you? What are the highest and lowest fold changes (approx) found in our dataset?
## [1] -7.525772
## [1] 7.732408
There are genes that have very low counts across all the samples,
these can’t be tested for DE so they end up with an NA
p-adj. Let’s remove them:
## [1] 6660
Q4. How many genes were removed?
## [1] 19511
We can now explore our data a little bit more. We can focus on genes that are significant according to some criteria: false discovery rate (FDR) or log fold change. For example, we can filter the results so that 1% are expected to be false positives (genes with no actual difference in expression):
## log2 fold change (MLE): condition VL vs Ctrl
## Wald test p-value: condition VL vs Ctrl
## DataFrame with 2447 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003 9.14582 -1.813894 0.637905 -2.84352 4.46188e-03
## ENSG00000001084 684.88455 0.907674 0.212682 4.26776 1.97450e-05
## ENSG00000001497 688.87295 -0.771775 0.185793 -4.15395 3.26782e-05
## ENSG00000004399 1987.64309 -1.156232 0.375515 -3.07905 2.07658e-03
## ENSG00000004468 2498.66712 1.823404 0.372718 4.89219 9.97213e-07
## ... ... ... ... ... ...
## ENSG00000272620 10.53923 -2.20575 0.757704 -2.91110 3.60154e-03
## ENSG00000272742 5.09606 4.40812 1.170485 3.76606 1.65844e-04
## ENSG00000272763 17.72455 2.38666 0.561591 4.24982 2.13938e-05
## ENSG00000273149 899.49876 2.67861 0.438745 6.10515 1.02703e-09
## ENSG00000273442 2.66793 -4.58895 1.600247 -2.86765 4.13531e-03
## padj
## <numeric>
## ENSG00000000003 3.92143e-02
## ENSG00000001084 7.16238e-04
## ENSG00000001497 1.07882e-03
## ENSG00000004399 2.33523e-02
## ENSG00000004468 6.66323e-05
## ... ...
## ENSG00000272620 3.37997e-02
## ENSG00000272742 3.62756e-03
## ENSG00000272763 7.58936e-04
## ENSG00000273149 1.94547e-07
## ENSG00000273442 3.70961e-02
Q5. How many significant genes do you have?
## [1] 19511
To see the top 20 most upregulated genes:
## baseMean log2FoldChange lfcSE stat pvalue
## ENSG00000174358 127.930347 7.732408 1.3790641 5.606997 2.058675e-08
## ENSG00000204644 26.715322 7.418433 2.2926147 3.235796 1.213041e-03
## ENSG00000178752 343.840573 6.790581 0.7617054 8.914970 4.879473e-19
## ENSG00000268460 14.599232 6.536879 1.2171300 5.370732 7.841764e-08
## ENSG00000084734 6.307033 6.501329 1.2245724 5.309060 1.101919e-07
## ENSG00000251095 8.936396 6.404727 1.3896083 4.609016 4.045787e-06
## ENSG00000124749 5.803896 6.388858 1.5923904 4.012118 6.017648e-05
## ENSG00000206178 8.390654 6.310398 1.4684681 4.297266 1.729179e-05
## ENSG00000119630 22.893879 6.209982 1.1969620 5.188119 2.124284e-07
## ENSG00000237568 11.507078 6.184514 1.2234873 5.054825 4.307860e-07
## ENSG00000196565 36493.660290 6.094282 1.1513928 5.292965 1.203489e-07
## ENSG00000133742 5093.855886 6.085722 0.6552484 9.287656 1.577231e-20
## ENSG00000182704 6.053909 5.836124 1.7722503 3.293059 9.910378e-04
## ENSG00000224091 5.961097 5.806018 1.3013471 4.461545 8.137103e-06
## ENSG00000217416 3.846914 5.784767 1.4933008 3.873813 1.071458e-04
## ENSG00000073754 5.617134 5.729365 1.2323434 4.649163 3.332844e-06
## ENSG00000148734 7.916617 5.639529 1.3492045 4.179892 2.916475e-05
## ENSG00000087085 213.085623 5.525337 0.7295231 7.573902 3.621765e-14
## ENSG00000248227 3.061367 5.468280 1.5903005 3.438520 5.849037e-04
## ENSG00000230601 3.056478 5.452693 1.5320956 3.558977 3.723024e-04
## padj
## ENSG00000174358 2.608235e-06
## ENSG00000204644 1.600246e-02
## ENSG00000178752 1.586723e-15
## ENSG00000268460 7.886632e-06
## ENSG00000084734 1.048758e-05
## ENSG00000251095 2.055660e-04
## ENSG00000124749 1.741993e-03
## ENSG00000206178 6.563815e-04
## ENSG00000119630 1.786505e-05
## ENSG00000237568 3.362026e-05
## ENSG00000196565 1.139867e-05
## ENSG00000133742 7.693338e-17
## ENSG00000182704 1.385110e-02
## ENSG00000224091 3.641354e-04
## ENSG00000217416 2.652948e-03
## ENSG00000073754 1.748041e-04
## ENSG00000148734 9.844870e-04
## ENSG00000087085 2.826570e-11
## ENSG00000248227 9.517978e-03
## ENSG00000230601 6.795129e-03
kbl(as.matrix(res_sig[ order(res_sig$log2FoldChange, decreasing = TRUE),][1:top,]),
caption = "VL_vs_Ctrl Significant DE genes") %>%
kable_classic(full_width = F, html_font = "Cambria")| baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | |
|---|---|---|---|---|---|---|
| ENSG00000174358 | 127.930347 | 7.732408 | 1.3790641 | 5.606997 | 0.0000000 | 0.0000026 |
| ENSG00000204644 | 26.715322 | 7.418433 | 2.2926147 | 3.235796 | 0.0012130 | 0.0160025 |
| ENSG00000178752 | 343.840573 | 6.790581 | 0.7617054 | 8.914970 | 0.0000000 | 0.0000000 |
| ENSG00000268460 | 14.599232 | 6.536879 | 1.2171300 | 5.370732 | 0.0000001 | 0.0000079 |
| ENSG00000084734 | 6.307033 | 6.501329 | 1.2245724 | 5.309060 | 0.0000001 | 0.0000105 |
| ENSG00000251095 | 8.936396 | 6.404727 | 1.3896083 | 4.609016 | 0.0000040 | 0.0002056 |
| ENSG00000124749 | 5.803896 | 6.388858 | 1.5923904 | 4.012118 | 0.0000602 | 0.0017420 |
| ENSG00000206178 | 8.390654 | 6.310397 | 1.4684681 | 4.297266 | 0.0000173 | 0.0006564 |
| ENSG00000119630 | 22.893879 | 6.209982 | 1.1969620 | 5.188119 | 0.0000002 | 0.0000179 |
| ENSG00000237568 | 11.507078 | 6.184514 | 1.2234873 | 5.054825 | 0.0000004 | 0.0000336 |
| ENSG00000196565 | 36493.660290 | 6.094282 | 1.1513928 | 5.292965 | 0.0000001 | 0.0000114 |
| ENSG00000133742 | 5093.855886 | 6.085722 | 0.6552484 | 9.287656 | 0.0000000 | 0.0000000 |
| ENSG00000182704 | 6.053909 | 5.836124 | 1.7722503 | 3.293059 | 0.0009910 | 0.0138511 |
| ENSG00000224091 | 5.961097 | 5.806018 | 1.3013471 | 4.461544 | 0.0000081 | 0.0003641 |
| ENSG00000217416 | 3.846914 | 5.784767 | 1.4933008 | 3.873813 | 0.0001071 | 0.0026529 |
| ENSG00000073754 | 5.617134 | 5.729366 | 1.2323434 | 4.649163 | 0.0000033 | 0.0001748 |
| ENSG00000148734 | 7.916617 | 5.639529 | 1.3492045 | 4.179892 | 0.0000292 | 0.0009845 |
| ENSG00000087085 | 213.085623 | 5.525337 | 0.7295231 | 7.573902 | 0.0000000 | 0.0000000 |
| ENSG00000248227 | 3.061367 | 5.468280 | 1.5903005 | 3.438520 | 0.0005849 | 0.0095180 |
| ENSG00000230601 | 3.056478 | 5.452693 | 1.5320956 | 3.558977 | 0.0003723 | 0.0067951 |
Q6. Make a note of the genes. Change the
decreasingargument toFALSE, what type of genes do you get? Make also a note of these genes.
kbl(as.matrix(res_sig[ order(res_sig$log2FoldChange, decreasing = FALSE),][1:top,]),
caption = "VL_vs_Ctrl Significant DE genes") %>%
kable_classic(full_width = F, html_font = "Cambria")| baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | |
|---|---|---|---|---|---|---|
| ENSG00000142408 | 40.055401 | -7.525772 | 1.3864284 | -5.428172 | 0.0000001 | 0.0000062 |
| ENSG00000105366 | 245.062705 | -6.462171 | 1.4942735 | -4.324624 | 0.0000153 | 0.0006010 |
| ENSG00000138395 | 8.202745 | -6.210153 | 1.4382885 | -4.317738 | 0.0000158 | 0.0006090 |
| ENSG00000170558 | 26.778054 | -6.054004 | 1.2766064 | -4.742263 | 0.0000021 | 0.0001195 |
| ENSG00000103355 | 357.316605 | -5.935067 | 1.2072165 | -4.916323 | 0.0000009 | 0.0000604 |
| ENSG00000122733 | 24.171931 | -5.882075 | 1.3418652 | -4.383507 | 0.0000117 | 0.0004879 |
| ENSG00000215148 | 6.275105 | -5.825145 | 1.4864657 | -3.918789 | 0.0000890 | 0.0023183 |
| ENSG00000205927 | 83.380742 | -5.625245 | 1.4350377 | -3.919929 | 0.0000886 | 0.0023133 |
| ENSG00000261327 | 5.375587 | -5.598636 | 1.4030454 | -3.990346 | 0.0000660 | 0.0018619 |
| ENSG00000146122 | 122.607798 | -5.550336 | 0.9217390 | -6.021591 | 0.0000000 | 0.0000003 |
| ENSG00000239265 | 5.198719 | -5.546517 | 1.9439918 | -2.853159 | 0.0043287 | 0.0384690 |
| ENSG00000130433 | 82.227780 | -5.473095 | 0.9926391 | -5.513680 | 0.0000000 | 0.0000041 |
| ENSG00000130037 | 4.648468 | -5.389323 | 1.3911242 | -3.874077 | 0.0001070 | 0.0026529 |
| ENSG00000260105 | 4.659556 | -5.375267 | 1.4009836 | -3.836781 | 0.0001247 | 0.0029940 |
| ENSG00000254396 | 4.445459 | -5.321777 | 1.4053036 | -3.786924 | 0.0001525 | 0.0034324 |
| ENSG00000225449 | 4.215827 | -5.245733 | 1.7440426 | -3.007801 | 0.0026315 | 0.0272807 |
| ENSG00000161905 | 626.710813 | -5.191873 | 1.4454512 | -3.591870 | 0.0003283 | 0.0061534 |
| ENSG00000106078 | 3.678201 | -5.040158 | 1.4364829 | -3.508679 | 0.0004503 | 0.0078451 |
| ENSG00000187498 | 7.183593 | -5.034032 | 1.3375390 | -3.763653 | 0.0001674 | 0.0036545 |
| ENSG00000168269 | 3.522239 | -4.996943 | 1.5667296 | -3.189410 | 0.0014256 | 0.0179654 |
It is also important to have a graphical representation of these results. A heat map of the significantly DE genes can be generated as follows:
# Selecting the rlog-transformed counts of the DE genes
sig2heatmap <- assay(rld)[ rownames( assay(rld) ) %in% rownames(res_sig), ]
# Calculating the difference of the counts to the mean value
sig2heatmap_Mean <- sig2heatmap - rowMeans(sig2heatmap)
colnames(sig2heatmap_Mean) <- sampleNames
# Plotting
pheatmap(sig2heatmap_Mean,
main = paste("VL_vs_Ctrl Significant DE genes (", alfa, ")", sep=" "),
annotation_col = annotation_col,
annotation_colors = annotation_colors,
show_rownames = FALSE)Q7. What can you conclude from the heatmap?
You can of course plot the top 20 most significant genes:
# Selecting the top 20 genes based on adjusted pvalue
res_top <-res_sig[ order(res_sig$padj), ][ 1:top, ]
# Selecting the rlog-transformed counts of the DE genes
top2heatmap <- assay(rld)[ rownames(assay(rld)) %in% rownames(res_top), ]
colnames(top2heatmap) <- sampleNames
# Calculating the difference of the counts to the mean value
top2heatmap_Mean <- top2heatmap - rowMeans(top2heatmap)
colnames(top2heatmap_Mean) <- sampleNames
# Plotting
pheatmap(top2heatmap_Mean,
main = paste("VL_vs_Ctrl Significant DE genes (", alfa, "-", top, "genes )", sep = " "),
annotation_col = annotation_col,
annotation_colors = annotation_colors)Q8. What can you conclude from this heatmap? Are the plotted genes in your top 20 most up/down regulated present in this list?
And all the genes at the same time:
EnhancedVolcano(res, # data to plot
lab = rownames(res), # labels (protein name)
x = "log2FoldChange", # value to plot in the X-axis
y = "pvalue", # value to plot in the Y-axis
pCutoff = 0.001, # cut off for the pvalue
FCcutoff = 2, # cut off for the fold change
title = "VL_vs_Ctrl", # Title of the plot
subtitle = NULL, # Remove subtitle to create more space for the plot
caption = NULL, # Remove caption to create more space for the plot
legendPosition = "top", # Position the legend on top of the plot
axisLabSize = 11, # Set font size for axis labels
labSize = 2.8, # Set font size for protein labels
)